Erasmus is program by European union that allows students to study on foreign university through an exchange. There are a lot of parameters every student consideres when choosing a destination. In this final project we used dataset provided by European union in order to explore if there are any relationship between students characteristics like age, gender nationality etc. and their destination choice for Erasmus. We have also provided different plots to understand dataset better. Finally we have created a machine learning model that predicts destination for student.
The dataset that we used is one from 2012-2013 academic year that can be found here. The dataset is published directly by European Union. It was created from the statistical reports of the national agencies of the 33 countries participating in the Erasmus+ program (Erasmus decentralised actions) and data provided by Education Audiovisual and Culture Executive Agency (Erasmus centralised actions). The data is generated during the application process of the student and then collected by the respective universities. It contains 267547 observations and has 34 different variables.
Host institution country is one of the most interesting variables to us and we can see that it has a lot of undefined values, around 55 thousand, so we need to filter those out. For both host and home country, values are coded as country codes. However Belgium is coded as three diferent values: “BEDE”, “BEFR” and “BENL” depending on the language area (Dutch, France or German). We are going to merge all of this values to a single one for whole Belgium.
There are 34 different vairables and we are not going to use all of them, so we list ones that are most relevant for our research:
First thing we wanted to explore is to see if there is a difference between number of male and females enrolled in Erasmus. We were expecting to see significant difference as one of the cited papers suggest that there is gender gap. Pie chart we presented here to confirm this assumption.
Next we wanted to see what are the countries with most students goint to Erasmus. In order to not just list them, we decided to present this metric in a Europe map, coloring each country regarding the number of students with home university in that country. We can see that Spain, France and Germany are leading in students enrolled in Erasmus. Surprising thing is to see that Turkey lists very high.
Other thing that was in our interest is the areas in which Erasmus is most popular. The dataset contains codes of each area adn we have used The International Standard Classification of Education to map those codes to names of areas. We have also merged areas that start with same two numbers since those are related and finally displayed statistic in form of bar plot.
To explore data further, we wanted to see age distibution. At that point we noticed that there was a student that attended Erasmus at the age of 93. There were some other unordinary records as 73 and 69 years old students. Despite that we present student distribution by age of 30 where most of the students are. 22 year old students were most frequent among males, and 21 year old students among females. On this plot we can also see that there are more female students in pretty much every category.
Last thing we wanted to explore is what are the 10 most popular universities in Europe among students. This is a simple bar plot that shows universities and number of ERASMUS students enrolled in those universities. Sweden is leading with universities in Stockholm and Linköping, while third place belongs to university in Valencia.
We want to have host country as our outcome variable and see how other variables related to it. There are 34 different variables but not all of them make sanse to include in model. After exploring dataset we decided that we need just a couple of them. Here is the formula of our model:
HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE
We also provide explanation of why we included every variable:
First thing that comes to our mind when talking about strenght of relationships is linear model aclled by function lm(). However we are not having linear problem and therefore we cannot use this function. So our next option is logistic regression which has categorical variables for its outcome. Only problem here is that we don’t have binary outcome which is usually the case with logistic regression, but multiple classes. Precisely, since host country is our dependent variable we have as many categories as there are countries in that column. So for dataset 2012-2013 there are 33 countries and that is how many classes we have. There is where multinominal model with as many classes as we want comes handy. We use multinom() function from package nnet and have specified data, formula, maximum number of weights and number of iterations. Finally we created a model in R with following command:
model <- multinom(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, MaxNWts=3000, maxit = 20)
We adjusted the model so that is has maximum 3000 weights and 20 iterations.
Even thought we managed to create this model, calculating its summary just didn’t end in reasonable time so we had to take another approach. Only because of this we reduced our dataset so that we have only two outcome categories UK and ES. So we are creating model with only those two classes. Now we can apply logistic regression model since the outcome is binary. Model is created by following command:
model <- glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, family = binomial())
This calculation is done much faster so we can explore strenght of realtionships properly.
Summary of our logistic regression model is presented below:
Call:
glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE +
STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE,
family = binomial(), data = filtered)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8644 -0.8695 -0.5691 1.0946 3.1778
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.065424 0.559370 -0.117 0.906892
STUDENT_NATIONALITY_CDEBE -0.510193 0.089534 -5.698 1.21e-08 ***
STUDENT_NATIONALITY_CDEBG -0.201445 0.149165 -1.350 0.176862
STUDENT_NATIONALITY_CDECH 0.195356 0.115699 1.688 0.091320 .
STUDENT_NATIONALITY_CDECY -0.082228 0.242261 -0.339 0.734295
STUDENT_NATIONALITY_CDECZ 0.298433 0.097206 3.070 0.002140 **
STUDENT_NATIONALITY_CDEDE -0.015585 0.073135 -0.213 0.831253
STUDENT_NATIONALITY_CDEDK 1.006396 0.107310 9.378 < 2e-16 ***
STUDENT_NATIONALITY_CDEEE 0.249332 0.197660 1.261 0.207159
STUDENT_NATIONALITY_CDEES 4.032322 0.126069 31.985 < 2e-16 ***
STUDENT_NATIONALITY_CDEFI 0.720312 0.096938 7.431 1.08e-13 ***
STUDENT_NATIONALITY_CDEFR 0.380214 0.073581 5.167 2.38e-07 ***
STUDENT_NATIONALITY_CDEGR -0.546220 0.120104 -4.548 5.42e-06 ***
STUDENT_NATIONALITY_CDEHR -0.792309 0.251065 -3.156 0.001601 **
STUDENT_NATIONALITY_CDEHU -0.139422 0.127126 -1.097 0.272765
STUDENT_NATIONALITY_CDEIE -0.780883 0.133951 -5.830 5.55e-09 ***
STUDENT_NATIONALITY_CDEIS 0.167390 0.253194 0.661 0.508539
STUDENT_NATIONALITY_CDEIT -0.972142 0.075588 -12.861 < 2e-16 ***
STUDENT_NATIONALITY_CDELI -10.305356 84.438362 -0.122 0.902863
STUDENT_NATIONALITY_CDELT -0.453280 0.153101 -2.961 0.003070 **
STUDENT_NATIONALITY_CDELU -0.280322 0.391914 -0.715 0.474446
STUDENT_NATIONALITY_CDELV -1.072289 0.245016 -4.376 1.21e-05 ***
STUDENT_NATIONALITY_CDEMT 2.796590 0.416106 6.721 1.81e-11 ***
STUDENT_NATIONALITY_CDENL 0.574346 0.086693 6.625 3.47e-11 ***
STUDENT_NATIONALITY_CDENO 1.036971 0.125072 8.291 < 2e-16 ***
STUDENT_NATIONALITY_CDEPL -0.978837 0.087812 -11.147 < 2e-16 ***
STUDENT_NATIONALITY_CDEPT -1.108624 0.106818 -10.379 < 2e-16 ***
STUDENT_NATIONALITY_CDERO -0.956060 0.133953 -7.137 9.52e-13 ***
STUDENT_NATIONALITY_CDESE 1.009450 0.097950 10.306 < 2e-16 ***
STUDENT_NATIONALITY_CDESI -0.754328 0.171553 -4.397 1.10e-05 ***
STUDENT_NATIONALITY_CDESK -0.491452 0.135884 -3.617 0.000298 ***
STUDENT_NATIONALITY_CDETR -0.847052 0.104855 -8.078 6.57e-16 ***
STUDENT_NATIONALITY_CDEUK -4.049174 0.226146 -17.905 < 2e-16 ***
STUDENT_AGE_VALUE -0.004137 0.005066 -0.817 0.414192
STUDENT_SUBJECT_AREA_VALUE1 -0.915442 0.611400 -1.497 0.134318
STUDENT_SUBJECT_AREA_VALUE10 -0.049669 0.705469 -0.070 0.943871
STUDENT_SUBJECT_AREA_VALUE14 -0.312324 0.546546 -0.571 0.567695
STUDENT_SUBJECT_AREA_VALUE2 0.337524 0.714199 0.473 0.636505
STUDENT_SUBJECT_AREA_VALUE21 0.076313 0.545027 0.140 0.888647
STUDENT_SUBJECT_AREA_VALUE22 -0.175862 0.543294 -0.324 0.746168
STUDENT_SUBJECT_AREA_VALUE3 0.169626 0.553877 0.306 0.759412
STUDENT_SUBJECT_AREA_VALUE31 -0.609875 0.543833 -1.121 0.262101
STUDENT_SUBJECT_AREA_VALUE32 -0.872924 0.547677 -1.594 0.110966
STUDENT_SUBJECT_AREA_VALUE34 -0.735747 0.543484 -1.354 0.175813
STUDENT_SUBJECT_AREA_VALUE38 -0.155832 0.544337 -0.286 0.774665
STUDENT_SUBJECT_AREA_VALUE4 0.179354 0.756269 0.237 0.812535
STUDENT_SUBJECT_AREA_VALUE42 -0.036296 0.548599 -0.066 0.947250
STUDENT_SUBJECT_AREA_VALUE44 0.010724 0.546164 0.020 0.984334
STUDENT_SUBJECT_AREA_VALUE46 0.179673 0.551668 0.326 0.744658
STUDENT_SUBJECT_AREA_VALUE48 -0.159184 0.549616 -0.290 0.772102
STUDENT_SUBJECT_AREA_VALUE5 -1.524772 0.651268 -2.341 0.019220 *
STUDENT_SUBJECT_AREA_VALUE52 -0.423912 0.544501 -0.779 0.436254
STUDENT_SUBJECT_AREA_VALUE54 -0.465939 0.559786 -0.832 0.405210
STUDENT_SUBJECT_AREA_VALUE58 -0.970780 0.545905 -1.778 0.075356 .
STUDENT_SUBJECT_AREA_VALUE6 -1.042265 0.592845 -1.758 0.078735 .
STUDENT_SUBJECT_AREA_VALUE62 -0.988229 0.557883 -1.771 0.076496 .
STUDENT_SUBJECT_AREA_VALUE64 -2.723809 0.685860 -3.971 7.15e-05 ***
STUDENT_SUBJECT_AREA_VALUE72 -1.195239 0.546208 -2.188 0.028651 *
STUDENT_SUBJECT_AREA_VALUE76 -0.523986 0.563750 -0.929 0.352648
STUDENT_SUBJECT_AREA_VALUE8 0.368579 1.075075 0.343 0.731718
STUDENT_SUBJECT_AREA_VALUE81 -1.162951 0.547930 -2.122 0.033801 *
STUDENT_SUBJECT_AREA_VALUE84 -1.123646 0.681346 -1.649 0.099116 .
STUDENT_SUBJECT_AREA_VALUE85 -1.171075 0.645957 -1.813 0.069842 .
STUDENT_SUBJECT_AREA_VALUE86 1.129102 1.224562 0.922 0.356505
STUDENT_SUBJECT_AREA_VALUE90 -10.391941 84.478372 -0.123 0.902097
STUDENT_SUBJECT_AREA_VALUE99 0.229852 0.581121 0.396 0.692450
STUDENT_GENDER_CDEM 0.150036 0.023913 6.274 3.51e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 63615 on 48489 degrees of freedom
Residual deviance: 51770 on 48423 degrees of freedom
AIC: 51904
In the rightmost column we see the p-values as well as indicator of significance of eace independent variable. We can see that age makes no impact on the output variable since its p-value is too big. Gender, however, has very small p-value therefore it is a significant predictor. When it comes to study area, it can be easily concluded that this variable does not play significant role in estimating host country. Finally interesting thing to see is that most of categories in nationality are actually significant so we can say that it is correlated with dependent variable.
R squares is usually the measurment that represents variance covered by model. Logisttic regression model uses maximum likelihood to fit the function to data, and therefore does not minimize sqaured error. For that reason R sqaured is not outputed in summary. However we can use following formula to get sense of covered variance:
1-(model1$deviance/model1$null.deviance)
By deviding residual deviance and null deviance we are basically getting R squared and in our case it is around 18%. We can concluded that variance is poorly covered by this model.